Dataset of smart heat and water meter data with accompanying building characteristics

The data presented were sourced from 34,884 commercial smart heat meters and 10,765 commercial smart water meters, spanning a timeframe of up to 5 years (2018–2022). All data primarily originated from single-family houses in Aalborg Municipality, Denmark. Furthermore, comprehensive building characteristics were collected for each building, where available, from the Danish Building and Dwelling Register (BBR) and Energy Performance Certificate (EPC) input data. This effort yielded an extensive pool of up to 86 distinct characteristics per building. All smart meter data were processed employing a well-established methodology, resulting in equidistant hourly data without any erroneous or missing values. The building characteristics derived from the EPCs were additionally filtered using rule sets to improve the data quality. This dataset holds substantial value for researchers involved in the domains of the built environment, district heating, and water sectors.


Specifications
Direct URL to data: https://vbn.aau.dk/en/datasets/dataset-of-smart-heat-and-water-meter-data-with-accompanying-buil [1] Instructions for accessing these data: Part of the data was originally collected for billing purposes (hourly data from smart heat and water meters) and made available to the authors for scientific purposes via a data use agreement on the legal basis of GDPR article 89.The data were anonymised by the researchers.However, as the data can potentially be deanonymised in combination with the building characteristics through a backward search in the public Danish Building and Dwelling Register, the data are considered personal data subject to the GDPR.Researchers interested in using the data should contact the corresponding author (Anna Marszal-Pomianowska) and are required to complete a joint Data Use Agreement to document that the data sharing is lawful.It should be noted that, for researchers outside the European Union, possible additional requirements apply in accordance with applicable Danish and European law.
Once the agreement has been approved, the data which are stored in a PostgreSQL database, can be accessed via an API, which requires authentication via eduGAIN.

Value of the Data
• This dataset provides an unprecedented amount of data, particularly in conjunction with accompanying building information at a high level of detail.The easy and clearly documented accessibility of the data makes it useful for small-and large-scale research.• The data can be of great value for research in the built environment, the district heating, and water sectors.It provides countless opportunities for data-driven research and validation of models.• Within the domain of building-related research, the utility of this dataset becomes evident as it allows for the deepening of current knowledge on the use of heat energy in single-family houses, the refinement of fault detection methods, and the validation of urban building energy models.• In the field of district heating, the dataset assumes significance, as it facilitates the advancement of research in demand response and load shift, contributing to optimising district heating systems for increased responsiveness and sustainability.
• Together with the building information provided, high-resolution water data could provide valuable insight into the drivers of water use.It has the potential to uncover large-scale consumption patterns, providing a foundation for more effective water resource management strategies.• The unique combination of high resolution of water and energy data on such a large scale offers new possibilities for novel research possibilities focused on, for example, the separation of energy use for heating and domestic hot water.

Data Description
The data are structured within six tables in a database.An entity relationship diagram is shown in Fig. 1 .All data can be related, which is the core idea of the whole database.The meter ID is unique for all processed data and can be used as an identifier.For the raw smart meter data, it should be noted that there may be meters that are incorrectly assigned to two customers, so the uniqueness of the ID is not guaranteed for the raw data.The customer ID can be used to link Smart Heat Meter (SHM) and Smart Water Meter (SWM) data.It should be noted that a customer can have one or more meters.For this reason, there may be duplicate entries in the Danish Building and Dwelling Register (BBR) data, differing only in the meter ID, e.g., if a customer has one SHM but two SWM, then there are two entries, identical except for the SWM ID (both entries have the same SHM ID).As EPCs are only valid for 10 years and due to the established validity criteria as outlined in Section 3.4 , the EPC data have a dependency on the data period.Due to this, there may be several identical entries for the same building, e.g., one for the SWM data, one for the SHM data, or several for the SHM data if the SHM data have several periods.Fig. 2 gives an overview of the number of meters for which the respective data (processed data for SHM and SWH data) are available in the database.In the following, each table is described separately.

Raw data
This table contains the data as collected by the SHMs installed in the respective buildings in the Municipality of Aalborg, Denmark.An overview of all columns included in this dataset is given in Table 1 .The data span from the beginning of 2018 to the end of 2022 (with different lengths for each building) and contain data from a total of 34884 SHMs (9.46e + 08 rows).Data from a building may not be complete, that is, a building may have data from 2018 and 2020 but no data from 2019.The data have not been processed in any way other than by eliminating redundant columns of units of measurement to reduce the amount of storage space required.As the data are not processed, they are not exactly hourly, as the SHMs have a temporal accuracy of ±30 min around the full hour.In addition, the original data were delivered to the researchers with a timestamp in local time (CET/CEST) but without any time zone information.Consequently, for readings between 2 and 3 o'clock on the day where summertime ends, and thus the hour between 2 and 3 o'clock exists twice, once in summertime (CEST) and once in standard time (CET), it cannot be distinguished if these readings originate from CEST or CET.The data contain missing values due to errors in the transmission infrastructure used to collect the data.

Processed data
The processed data table contains the processed data from the SHMs.It contains data from 34795 SHMs (9.33e + 08 rows), and an overview of all available columns is given in Table 1 .These data are equidistant, have no erroneous values (in terms of transmission errors or incorrect meter assignment), and missing values have been imputed.The processing used is described in detail in Section 3.1 .Fig. 3 shows the number of processed SHMs available for the different years of the data period.

Smart water meter data 2.2.1. Raw data
This table contains the data as collected via the SWMs installed in the respective buildings in Aalborg Municipality, Denmark.An overview of all columns included in this dataset is given in Table 2 .The dataset covers the period from the beginning of May 2021 to the end of 2022 (with different lengths for each building) and contains in total data from 10765 SWMs (7.19e + 07 rows).The data have not been processed in any way other than by removing redundant columns containing units of measurement to reduce the amount of storage required.As the data have not been processed, they are not exactly hourly as the SWMs have a time accuracy of ±30 min around the full hour.The data have been supplied with UTC timestamps, so unlike the SHM data, the timestamp is always correct.

Processed data
The processed data table contains the processed data from the SWMs.It contains data from 10,510 SWMs (7.04e + 07 rows), and an overview of all available columns is given in Table 2 .These data are equidistant, have no erroneous values (in terms of transmission errors or incorrect meter assignment), and missing ones have been imputed.The processing used is described in detail in Section 3.2 .-

Statistical building characteristics (BBR)
For each building for which either SHM or SWM data are available in this dataset, the corresponding data from the BBR have been collected where possible.This publicly available database in Denmark contains information on every building in Denmark and is operated by the Danish Customs and Tax Administration.An overview of the available columns is given in Table 3 .The process of collecting the data is described in Section 3.3 .

Detailed building characteristics (EPC)
For each building for which either SHM or SWM data are available in this dataset, the input data from the corresponding Energy Performance Certificate (EPC), if available, were collected and processed from the EPC database developed by Brøgger and Wittchen, 2016 [3] and hosted at Aalborg University.An overview of the available columns is given in Table 4 .The processing used to derive the data is described in Section 3.4 .

Smart heat meter data processing
The SHM data were obtained by the authors from the local utility company as .csvfiles.As mentioned above, the readings were provided in local time (CET/CEST) without any time zone information.As the dataset is similar to the one described in detail by Schaffer et al. [ 5 ], a similar cleaning and imputation framework was applied to obtain equidistant data without erroneous or missing values.The only difference from the framework described in Schaffer et al. [ 5 ] is that, due to the long data period and the higher uncertainty in data quality, it was tested that there were at least 8584 h of data per year and per smart meter (approximately 2 % of  The temperature factor is a fraction between 0 and 1, used to account for the fact that the outside of a building component may face a different temperature than the outside air temperature or that inside of a component can face a different temperature than the room temperature.This is 1.0 for the 'standard' case and 0.7 is commonly used for cases such as a ground deck without underfloor heating or exterior walls of the basement deeper than 2 metres.
Opaque envelope opaque_heatloss_total W Total heat losses through the opaque envelope, taking the dimensioning temperature into account, were calculated as follows: 2 For an explanation of the temperature factor see Equation 1.The dimensioning temperatures are thereby calculated based on the Danish standard DS 418:2011.Standard values are thereby 20 °C for the interior, 30 °C interior temperature for a floor with floor heating, -12 °C for the exterior, and 10 °C for exterior elements against soil deeper than 2m.
( continued on next page ) 4 For an explanation of the temperature factor see Equation 1.For an explanation of the dimensioning temperatures see Eq. 2. window_solar_north m ² Total solar factor of all windows facing north (orientation > 315 °OR orientation < = 45 °), calculated as: i n = 1 nr of windowsn × arean × g v aluen × glass sharen × shading factor 5 Whereby the shading factor was calculated from the angles to the shading objects of each window based on the simplified method stated in [ 4 ].For objects shading from the side as well as overhang, an infinite height and length were assumed.It is to be noted that the shading from the wall thickness could not be considered as the simplified method is based on the wall thickness, which is not an input for EPCs.( continued on next page ) The total solar factor of all skylights was calculated as stated in Eq. 5.
Thermal bridge thermal_bridge_kelvin W/K Total heat losses through thermal bridges were calculated as follows: It is, however, to be noted that the Danish EPC calculation method is insensitive to the tank volume.For this reason, many buildings have a total share of domestic hot water covered by the domestic hot water tank larger than zero with a 0L tank volume.dhw_tank_heat_loss W/K Total heat losses from domestic hot water tanks were calculated as follows:   Default values for internal heat gains from occupants are 1.5 W/m 2 but at maximum 360 W for residential buildings and 4 W/m 2 for non-residential buildings.gains_device W Total heat gains from appliances inside usage hours, calculated as follows: i n = 1 arean × appliances heat gains per arean 15 gains_device_outside W Total heat gains from appliances outside usage hours were calculated as stated in Eq. 15.
Heating system heating_supply_temp °C Supply temperature of the heat distribution system.heating_return_temp °C Return temperature of the heat distribution system.heating_pipes W/K Total heat losses through heating pipes, calculated as:  Temperature efficiency refers to the efficiency of the heat recovery.
Ventilation system vent_inlet_temperature_code nominal Categorisation of ventilation, heat recovery and heating coil, based on the maximum vent_mech_winter for the first three categories.If vent_mech_winter is zero, "Type 4" is selected.
• Type 1 = ventilation system with temperature-controlled heat recovery (and temperature-controlled heating coil) • Type 2 = ventilation system with NOT temperature-controlled heat recovery and temperature-controlled heating coil • Type 3 = ventilation system with NOT temperature-controlled heat recovery and NO (temperature-controlled) heating coil  missing data).If this threshold was exceeded, only the year in question was excluded.Thus, an SHM may have data in nonconsecutive years in the processed data.Consequently, these data sequences can be considered as separate data.For this reason, the period column ( Table 1 ) has been introduced.This column, starting with one, indicates whether the SHM data are from a different sequence, i.e., if an SHM has data in 2018 and 2020-2022 but no data in 2019, the period column is 1 for all data in 2018 and 2 for all data in 2020-2022.
In addition to this basic data treatment, the SPMS method developed by Schaffer et al. [2] was applied to energy use.SPMS was developed to reduce the error introduced by rounding the raw cumulative energy data to integer values.The result of this process is available as a separate column (heat_energy_kwh_spms) in the processed data ( Table 1 ).

Smart water meter data processing
The authors obtained the SWM data from the local utility company as .csvfiles.The data were provided with readings in UTC.Given the same nature of the data (cumulative and approximately hourly), the same cleaning and imputation framework as for the SHM data was used to process the SWM data.However, given the varying data period, the threshold for missing values was set at 2% for each SWM individually, based on the first and last recorded value, to account for the different lengths of the datasets.SWMs with more than 2% missing values were excluded.

BBR data processing
The address was the only customer information provided by the utility company to link SHM and SWH data to a building/unit.It was unclear whether the address referred to a unit (e.g., an apartment) or a building (e.g., an apartment building).The address was used to retrieve the building characteristics from the BBR database.To prevent incorrect information from influencing the retrieval of building characteristics, the address information provided was treated with the Address Cleaning API, which is part of the Danish Address Web API (DAWA) [ 6 ].This API can translate unstructured addresses with possible misspellings into official addresses.In addition to the address information, the API returns the certainty of the match expressed in three levels: A -identical match, B -certain match, and C -uncertain match.Only results with a confidence of A or B were considered valid.As the address cleaning API distinguishes between unit and building addresses, all addresses were initially treated as unit addresses, and only addresses with a certainty of C were subsequently treated as building addresses.Addresses for which neither a unit nor a building address could be found with high confidence (level A or B) were excluded.
The BBR information was obtained through Denmark's Address Web API (DAWA) [ 6 ].Information about a unit and its building could be obtained directly through the API.For the SHMs where only a building address was available, the 'access address id' had to be retrieved via the address before information about the building could be obtained.In both cases, more than one BBR record may be obtained, for example, if two or more units/buildings have the same address.In order to allow for a data structure where an SHM can be linked to zero or one BBR record, cases where more than one record was obtained were considered invalid and consequently not included in the database.All nominal values were translated into human-understandable terms in English.
As the main objective was to establish essential building characteristics for as many SHMs as possible, only mandatory BBR information was considered for the dataset.The building owner must provide this mandatory information and it is, therefore, subject to uncertainty.However, the data quality has recently been investigated [ 7 ] and it was concluded that the overall quality of the data is high and that the data quality has improved from 20 0 0 to 2013.

EPC data processing
To link the available EPC data from the EPC database developed by Brøgger and Wittchen, 2016 [3] and hosted at Aalborg University with the SHM and SWM data, the same 'cleaned' addresses as for the BBR data ( Section 3.3 ) were used.Given the sheer amount of information available in the EPCs, it was decided to focus mainly on data from five aspects: • Building envelope • Domestic hot water (DHW) The data quality of the Danish EPC has been heavily criticised in the past, as random checks have revealed errors in 20-30 % of all EPCs [ 8 ].For this reason, the cleaning framework developed by Brøgger [ 8 ] was applied.However, this framework was originally developed for the purpose of energy modelling of the building stock.Therefore, some criteria have been adapted, and some have been added to better fit the purpose of this dataset.All quality assurance criteria used can be found in the dataset repository [1] .
After the cleaning step, the information obtained was aggregated to obtain the same building characteristics for each building where the information was available.The resulting columns, including a description of how they were calculated, are shown in Table 4 .Only results where an EPC record could be clearly linked to one building were considered.Furthermore, only valid EPCs were considered.Validity was defined as the EPC being valid (no more than 10 years old) at least on the first day of the data period.For SHM data, each period was considered separately.Thus, if an SHM has two periods, one period may have EPC information available, and the other may not, or the information may differ between the periods.In addition, several EPCs can be valid simultaneously, as the EPCs are not invalidated when a new EPC is issued.If two EPCs are valid for an SHM or SWM, the information from the most recent EPC was used.Furthermore, if an EPC was issued during the data period of the respective SHM or SWM, all EPCs were considered invalid for this period, as it is assumed that the building has been renovated and, therefore, the data represent two different building conditions.The need for this assumption also originated from the fact that it is currently not possible to easily track the changes from one EPC to another.

Limitations
Despite the substantial efforts invested in mitigating the uncertainty associated with building characteristics data, it is important to acknowledge that some level of uncertainty persists.In addition, the BBR database used has no version control or modification history.Therefore, the data can only be extracted from the current version.Therefore, it cannot be ruled out that the data changed between the time the SHM or SHW data were recorded and the time the BBR data were retrieved.

Fig. 2 .
Fig. 2. Meter ID and customer ID based number of meters available in the respective group based on the processed data.

Fig. 3 .
Fig. 3. Number of processed SHMs available in the different years of the data period.
Heat losses per Kelvin through all windows facing east (orientation > 45 °AND orientation < = 135 °), calculated as stated in Eq. 3. window_heatloss_east_total W Total heat loss through all windows facing east (orientation > 45 °AND orientation < = 135 °), taking the dimensioning temperature into account, is calculated as stated in Eq. 4. window_solar_east m ² Total solar factor of all windows facing east (orientation > 45 °AND orientation < = 135 °), calculated as stated in Eq. 5. Window south window_heatloss_south_kelvin W/K Heat losses per Kelvin through all windows facing south (orientation > 135 °AND orientation < = 225 °), calculated as stated in Eq. 3. window_heatloss_south_total W Total heat loss through all windows facing east (orientation > 135 °AND orientation < = 225 °), taking the dimensioning temperature into account, is calculated as stated in Eq. 4. window_solar_south m ² Total solar factor of all windows facing south (orientation > 135 °AND orientation < = 225 °), calculated as stated in Eq. 5.

i n = 1
nr of tanksn × heat lossn × temperature f actorn 9 Domestic hot water tank dhw_tank_sup_temp °C The required supply flow temperature from the central heating system to the domestic hot water tank was calculated as follows: i n = 1 supply temperaturen n 10 Due to the above-mentioned fact that a large share of EPCs have a tank volume of 0l, the tank volume is not considered for averaging.dhw_tank_share -The total share of domestic hot water covered by the domestic hot water tanks.Calculated as: i n =1 share of consumption n 11 ( continued on next page )

i n = 1
lengthn × heat los sn × temperature f actorn 13 Internal gains gains_people W Total heat gains from occupants, calculated as follows: i n = 1 arean × occ heat gains per are an 14

i n = 1 • 1 :
lengthn × heat los sn × temperature f actorn 16 heating_type_code nominal Plant type: Single-circuit system • 2: Double circuit system (or parts of the installation are single circuit, and these are equipped with local mixing devices) ( continued on next page )

i n = 1 •
arean × v ent ilat ion f low per arean × usage f actorn 19 Solar plant solar_plant_type_code nominal Type of solar plant: • None = No solar plant (respectively solar plant with 0m 2 area) • UtilityWater = only for domestic hot water • RoomHeating = only for room heating • Combined = Combined for room heating and domestic hot None = No heat pump (respectively heat pump with 0 area fraction) • RoomHeating = only for room heating • UtilityWater = only for domestic hot water • Combined = One heat pump combined for room heating and domestic hot water • Duo = Two heat pumps, one for room heating and one for domestic hot water heatpump_area_fraction -Proportion of the total heated floor area of the building covered by the heat pump.If heat pumps supply heat to the ventilation system's supply air, a negative number indicates that there is also other heating in the rooms.

Table 1
Description of the raw and processed smart heat meter (SHM) data.

Table 2
Description of the raw and processed smart water meter (SWM) data.

Table 3
Description of data derived from the Danish building and dwelling register.ID, which functions as the key to link the data to the smart heat meter data.water_meter_id hash Unique hashed meter ID, which functions as the key to link the data to the smart water meter data.bathroom exists and if the bathroom is positioned inside the unit or outside the unit.unit_kitchen_pos_code nominal Information if a kitchen exists and if the kitchen is positioned inside the unit or outside the unit.unit_energy_code nominal Information about which voltage of electricity is available in the unit and if gas is available.
bbr_resolution nominalInformation on whether the address could be attributed to a unit or a building.If the address could only be linked to a building, information about the unit are missing.

Table 4
Description of data derived from the Danish energy performance certificate., which functions as the key to link the data to the smart heat meter data.water_meter_id hash Unique meter ID, which functions as the key to link the data to the smart water meter data.period integer As a meter can have data for non-consecutive years, the period indicates if the data of one meter is continuous or from two or more separated years.A period is thereby an integer ranging from 1 to n.
i n =1 area n × u value n × temperature factor n 1

Table 4 (
continued )The value was derived based on the maximum number of tanks with the respective electric heating possibility.(The volume could not be used, due to the above problem, that many EPCs have erroneously a 0l tank.)